Portfolio 1

In this portfolio we are working on analysis of the cycling data from social network site strava and Goldencheetah.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sn
import calendar
from matplotlib import pyplot as plt
from datetime import timedelta
plt.style.use('seaborn')
%matplotlib inline

Importing all the necessary libraries and python packages to be used in the analysis.

Analysis of Cycling Data

Loading Data


The first dataset is an export of my ride data from Strava, an online social network site for cycling and other sports. This data is a log of every ride since the start of 2018 and contains summary data like the distance and average speed. It was exported using the script stravaget.py which uses the stravalib module to read data. Some details of the fields exported by that script can be seen in the documentation for stravalib.

The exported data is a CSV file so that's easy to read, however the date information in the file is recorded in a different timezone (UTC) so we need to do a bit of conversion. In reading the data I'm setting the index of the data frame to be the datetime of the ride.

In [2]:
strava = pd.read_csv('data/strava_export.csv', index_col='date', parse_dates=True)
strava.index = strava.index.tz_convert('UTC')
print("Shape of the Strava Dataframe is:" , strava.shape)
strava.head()
Shape of the Strava Dataframe is: (268, 10)
Out[2]:
average_heartrate average_temp average_watts device_watts distance elapsed_time elevation_gain kudos moving_time workout_type
date
2018-01-02 20:47:51+00:00 100.6 21.0 73.8 False 15.2 94 316.00 m 10 73 Ride
2018-01-04 01:36:53+00:00 NaN 24.0 131.7 False 18.0 52 236.00 m 5 46 Ride
2018-01-04 02:56:00+00:00 83.1 25.0 13.8 False 0.0 3 0.00 m 2 2 Ride
2018-01-04 05:37:04+00:00 110.1 24.0 113.6 False 22.9 77 246.00 m 8 64 Ride
2018-01-05 19:22:46+00:00 110.9 20.0 147.7 True 58.4 189 676.00 m 12 144 Ride

The second dataset comes from an application called GoldenCheetah which provides some analytics services over ride data. This has some of the same fields but adds a lot of analysis of the power, speed and heart rate data in each ride. This data overlaps with the Strava data but doesn't include all of the same rides.

Again we create an index using the datetime for each ride, this time combining two columns in the data (date and time) and localising to Sydney so that the times match those for the Strava data.

In [3]:
cheetah = pd.read_csv('data/cheetah.csv', skipinitialspace=True)
cheetah.index = pd.to_datetime(cheetah['date'] + ' ' + cheetah['time'])
cheetah.index = cheetah.index.tz_localize('Australia/Sydney')
print("Shape of the cheetah dataframe is :", cheetah.shape)
cheetah.head()
Shape of the cheetah dataframe is : (251, 362)
Out[3]:
date time filename axPower aPower Relative Intensity aBikeScore Skiba aVI aPower Response Index aIsoPower aIF ... Rest AVNN Rest SDNN Rest rMSSD Rest PNN50 Rest LF Rest HF HRV Recovery Points NP IF TSS
2018-01-28 06:39:49+11:00 01/28/18 06:39:49 2018_01_28_06_39_49.json 202.211 0.75452 16.6520 1.31920 1.67755 223.621 0.83441 ... 0 0 0 0 0 0 0 222.856 0.83155 20.2257
2018-01-28 07:01:32+11:00 01/28/18 07:01:32 2018_01_28_07_01_32.json 226.039 0.84343 80.2669 1.21137 1.54250 246.185 0.91860 ... 0 0 0 0 0 0 0 245.365 0.91554 94.5787
2018-02-01 08:13:34+11:00 02/01/18 08:13:34 2018_02_01_08_13_34.json 0.000 0.00000 0.0000 0.00000 0.00000 0.000 0.00000 ... 0 0 0 0 0 0 0 0.000 0.00000 0.0000
2018-02-06 08:06:42+11:00 02/06/18 08:06:42 2018_02_06_08_06_42.json 221.672 0.82714 78.8866 1.35775 1.86002 254.409 0.94929 ... 0 0 0 0 0 0 0 253.702 0.94665 98.3269
2018-02-07 17:59:05+11:00 02/07/18 17:59:05 2018_02_07_17_59_05.json 218.211 0.81422 159.4590 1.47188 1.74658 233.780 0.87231 ... 0 0 0 0 0 0 0 232.644 0.86808 171.0780

5 rows × 362 columns

The GoldenCheetah data contains many many variables (columns) and I won't go into all of them here. Some that are of particular interest for the analysis below are:

Here are definitions of some of the more important fields in the data. Capitalised fields come from the GoldenCheetah data while lowercase_fields come from Strava. There are many cases where fields are duplicated and in this case the values should be the same, although there is room for variation as the algorithm used to calculate them could be different in each case.

  • Duration - overall duration of the ride, should be same as elapsed_time
  • Time Moving - time spent moving (not resting or waiting at lights), should be the same as moving_time
  • Elevation Gain - metres climbed over the ride
  • Average Speed - over the ride
  • Average Power - average power in watts as measured by a power meter, relates to how much effort is being put in to the ride, should be the same as * average_watts' from Strava
  • Nonzero Average Power - same as Average Power but excludes times when power is zero from the average
  • Average Heart Rate - should be the same as average_heartrate
  • Average Cadence - cadence is the rotations per minute of the pedals
  • Average Temp - temperature in the environment as measured by the bike computer (should be same as average_temp)
  • VAM - average ascent speed - speed up hills
  • Calories (HR) - Calorie expendature as estimated from heart rate data
  • 1 sec Peak Power - this and other 'Peak Power' measures give the maximum power output in the ride over this time period. Will be higher for shorter periods. High values in short periods would come from a very 'punchy' ride with sprints for example.
  • 1 min Peak Hr - a similar measure relating to Heart Rate
  • NP - Normalised Power, a smoothed average power measurement, generally higher than Average Power
  • TSS - Training Stress Score, a measure of how hard a ride this was
  • device_watts - True if the power (watts) measures were from a power meter, False if they were estimated
  • distance - distance travelled in Km
  • kudos - likes from other Strava users (social network)
  • workout_type - one of 'Race', 'Workout' or 'Ride'

Some of the GoldenCheetah parameters are defined in thier documentation.

Begins with analysis of strava and Goldencheetah files :

In [4]:
str_cth_join=strava.join(cheetah,how='inner')
print("Shape of the dataframe after joining strava & cheetah file is:",str_cth_join.shape)
str_cth_join.head()
Shape of the dataframe after joining strava & cheetah file is: (243, 372)
Out[4]:
average_heartrate average_temp average_watts device_watts distance elapsed_time elevation_gain kudos moving_time workout_type ... Rest AVNN Rest SDNN Rest rMSSD Rest PNN50 Rest LF Rest HF HRV Recovery Points NP IF TSS
2018-01-27 19:39:49+00:00 120.6 21.0 153.4 True 7.6 17 95.00 m 4 17 Ride ... 0 0 0 0 0 0 0 222.856 0.83155 20.2257
2018-01-27 20:01:32+00:00 146.9 22.0 187.7 True 38.6 67 449.00 m 19 67 Race ... 0 0 0 0 0 0 0 245.365 0.91554 94.5787
2018-01-31 21:13:34+00:00 109.8 19.0 143.0 False 26.3 649 612.00 m 6 113 Ride ... 0 0 0 0 0 0 0 0.000 0.00000 0.0000
2018-02-05 21:06:42+00:00 119.3 19.0 165.9 True 24.3 69 439.00 m 6 65 Ride ... 0 0 0 0 0 0 0 253.702 0.94665 98.3269
2018-02-07 06:59:05+00:00 124.8 20.0 151.0 True 47.1 144 890.00 m 10 134 Ride ... 0 0 0 0 0 0 0 232.644 0.86808 171.0780

5 rows × 372 columns

Analysis


We have done an inner join between the two given datasets- strava and cheetah and stored it in dataframe str_cth_join. As a result of the inner join we have 243 rows and 372 columns.

In [5]:
result=str_cth_join[(str_cth_join['device_watts']==True)]
result=result.dropna()
print("Shape of the dataframe after removing rides and na values is:", result.shape)
Shape of the dataframe after removing rides and na values is: (144, 372)

Analysis


Rides with no measured power are removed i.e., device_watts = True imples it is measured from the power meter whereas False device_watts implies they are estimated so we are just keeping the rows with measured power and also dropping columsn with NaN values.

In [6]:
pd.options.mode.chained_assignment = None
result['elevation_gain'] = pd.to_numeric(result['elevation_gain'].str.split().str[0])

Analysis


Converting elevation_gain into numeric form so that data can be plotted for this variable.

In [7]:
sn.distplot(result['distance'])
plt.show()

Analysis


As per the above metalib diagram , distribution of the variable distance is bi- modal. Bi modal distribution has two peaks.

In [8]:
sn.distplot(result['moving_time'])
plt.show()

Analysis


As per the above metalib diagram , distribution of the variable moving_time is bi- modal.

In [9]:
sn.distplot(result['Average Speed'])
plt.show()

Analysis


As per the above metalib diagram , distribution of the variable Average_speed is left skewed. Left skewed distribution occurs when mean is less than the median and tail is tilted towards the left.

In [10]:
sn.distplot(result['Average Power'])
plt.show()

Analysis


As per the above metalib diagram , distribution of the variable Average Power is almost normally distributed.

In [11]:
sn.distplot(result['TSS'])
plt.show()

Analysis


As per the above metalib diagram , distribution of the variable TSS is right skewed.

In [12]:
corr_result =result[["distance", "moving_time","Average Speed","Average Heart Rate","Average Power","NP","TSS","elevation_gain"]]
sn.pairplot(corr_result,diag_kind = 'kde',plot_kws = {'alpha': 0.4, 's': 40, 'edgecolor': 'k'})
Out[12]:
<seaborn.axisgrid.PairGrid at 0x169699006d8>

Analysis


Pairplot has clearly shown the correlation between all required variables such as distance, moving_time, Average Speed, Average Power, Average Heart Rate, NP (Normalised Power), TSS and elevation gain.

In [13]:
corr_result=result[['distance','Time Moving','Average Speed','NP','TSS','elevation_gain','Average Heart Rate','Average Power']]
corr_mtx_result=corr_result.corr()
corr_mtx_result
Out[13]:
distance Time Moving Average Speed NP TSS elevation_gain Average Heart Rate Average Power
distance 1.000000 0.968110 0.034581 0.156948 0.909636 0.869634 0.057835 -0.010553
Time Moving 0.968110 1.000000 -0.192527 0.031393 0.897472 0.882298 -0.114745 -0.169354
Average Speed 0.034581 -0.192527 1.000000 0.536113 -0.030817 -0.089907 0.757362 0.731292
NP 0.156948 0.031393 0.536113 1.000000 0.397483 0.205713 0.596401 0.841682
TSS 0.909636 0.897472 -0.030817 0.397483 1.000000 0.872069 0.053536 0.136925
elevation_gain 0.869634 0.882298 -0.089907 0.205713 0.872069 1.000000 0.040688 -0.031666
Average Heart Rate 0.057835 -0.114745 0.757362 0.596401 0.053536 0.040688 1.000000 0.758055
Average Power -0.010553 -0.169354 0.731292 0.841682 0.136925 -0.031666 0.758055 1.000000

Analysis


corr_mtx_result is the correlation matrix between the variables aformentioned in the previous cell, negative values are highlighted in the red. Higher the value in the correlation matrix depicts higher the realtion between the two respective variables.

Highly correlated:

Distance with Time Moving, Distance with TSS, distance with Elevation Gain.

Weakly correlated:

Distance with Average Speed, Distance with Normalized Power, Distance with Average Heart Rate and Distance with Average Power.

Negatively correlated:

Time Moving and Average Power, Average Speed and Time Moving, Average Speed and Elevation Gain etc.

In [14]:
wride=result[(result['workout_type']=='Ride')]
print("Shape of dataframe with workout type only Rides is ", wride.shape)

wrace=result[(result['workout_type']=='Race')]
print("Shape of dataframe with workout type only Race is ", wrace.shape)

wworkout=result[(result['workout_type']=='Workout')]
print("Shape of dataframe with workout type only workout is ", wworkout.shape)
Shape of dataframe with workout type only Rides is  (113, 372)
Shape of dataframe with workout type only Race is  (26, 372)
Shape of dataframe with workout type only workout is  (5, 372)

Analysis


Above piece of the code divide the result dataframe into three sub categories wride with which has only ride type of workout, wrace which has only race category records and wworkout which has only workout type data.

In [15]:
plt.scatter(wride["distance"],wride["Elevation Gain"],color='green', label="Ride")
plt.scatter(wrace["distance"],wrace["Elevation Gain"],color='red', label="Race")
plt.scatter(wworkout["distance"],wworkout["Elevation Gain"],color='blue',label="Workout")
plt.xlabel("Distance")
plt.ylabel("Elevation Gain")
plt.legend()
plt.show()

Analysis


In the above scatter plot between Distance and elapsed_time for Ride, Race and Workout. Distance is the how far they travel from the starting time and elevation gain is how many meteres climbed during the ride. We draw looming conclusions from the scatter plot :

  1. For Ride, as distance increases, the elevation_gain also increases continuously.
  2. For Race, as distance increases, the elevation_gain also increases to a certain level.
  3. For Workout, as distance increases, the elevation_gain remains constant . It depicts cyclist who are doing workout type of cycling are more focussed on covering distance rather than climbing.
In [16]:
plt.scatter(wride["elapsed_time"],wride["TSS"],color='black', label="Ride")
plt.scatter(wrace["elapsed_time"],wrace["TSS"],color='orange', label="Race")
plt.scatter(wworkout["elapsed_time"],wworkout["TSS"],color='yellow',label="Workout")
plt.xlabel("Elapsed Time")
plt.ylabel("Training Stress Score")
plt.legend()
plt.show()

Analysis


In the above scatter plot between Elapsed time and TSS for Ride, Race and Workout. Elapsed time is the time taken for the ride and TSS stands for Trainig Stress Score , it detemines how hard the ride was. We come to below conclusions :

  1. For Ride, as elapsed_time increases, the TSS increases continuosly.
  2. For Race, as elapsed_time increases, the TSS also increases but to a certain level.
  3. For Workout, as elapsed_time increases , the TSS also shown some increase but less than Ride type and it determines workout type cyclists are involved in less hard workout than ride type and race type .
In [17]:
plt.scatter(wride["TSS"],wride["Calories (HR)"],color='blue', label="Ride")
plt.scatter(wrace["TSS"],wrace["Calories (HR)"],color='red', label="Race")
plt.scatter(wworkout["TSS"],wworkout["Calories (HR)"],color='yellow',label="Workout")
plt.xlabel("TSS")
plt.ylabel("Calories (HR)")
plt.legend()
plt.show()

Analysis


In the above scatter plot between TSS and Calories(HR) for Ride, Race and Workout. Calories (HR) is the calorie expendature as estimated from heart rate data.and TSS stands for Trainig Stress Score , it detemines how hard the ride was. We come to below conclusions :

  1. For Ride, as TSS increases, the Calories(HR) increases continuosly and it is obvious as well because more calories would be required for the harder ride.
  2. For Race, as TSS increases, the Calories(HR) also increases but to a certain level and it also show some constant values as well.
  3. For Workout, as TSS increases , cyclists spend less calories than Ride and Race types.
In [18]:
distdf=pd.concat([wride['Distance'],wrace['Distance'],wworkout['Distance']], axis=1, keys=['Ride', 'Race','WorkOut'])
distdf.boxplot()
plt.ylabel("Distance")
plt.show()

Analysis


From the above boxplots, we can draw the following conclusions for Distance:

  1. For Ride type, the values start from 0 and go upto a range beyond 100 (i.e whisker extends from around 0 to somewhere around 112) and there are no outliers outside this range.
  2. For Race type,boxplot clearly depicts various outliers (data falls outside the range of the boxplot) and the whisker does not extend much as compared to the boxplot for workout_type = Ride
  3. For Workout type, spread of the boxplot is quite small and there is only one outlier in comparison of 3 outliers in the race type.
In [19]:
distdf=pd.concat([wride['Calories (HR)'],wrace['Calories (HR)'],wworkout['Calories (HR)']], axis=1, keys=['Ride', 'Race','WorkOut'])
distdf.boxplot()
plt.ylabel("Calories (HR)")
plt.show()

Analysis


From the below boxplots, we can draw the following conclusions for Calories(HR):

  1. For Ride type, the values start from 0 and go upto a range beyond 30000 (whisker starts from 0 and extend beyond 3000 ) and there are no outliers outside this range.
  2. For Race type,boxplot clearly depicts various outliers (data falls outside the range of the boxplot) and the whisker does not extend much as compared to the boxplot for workout_type = Ride
  3. For Workout type, spread of the boxplot is quite similar to the race type but there is no outlier in this dataframe.
In [20]:
distdf=pd.concat([wride['TSS'],wrace['TSS'],wworkout['TSS']], axis=1, keys=['Ride', 'Race','WorkOut'])
distdf.boxplot()
plt.ylabel("TSS")
plt.show()

Analysis


From the above boxplots, we can draw the following conclusions for TSS:

  1. For Ride type, the values start from 0 and go upto a range beyond 280 (whisker starts from 0 and extend beyond 280 ) and there are two outliers outside this range.
  2. For Race type,the spread is relatively smaller than previous dataframe but it has no value outside the range.
  3. For Workout type, the median is more or less similart with race type.

CHALLENGE ANALYSIS

In [21]:
kudos_rel=[result[result.columns[1:]].corr()['kudos'][:]] #result, dataframe where only rides keep whose measeure power is True
kudos_rel_df = pd.DataFrame(kudos_rel)
kudos_rel_df=kudos_rel_df.loc[:,kudos_rel_df.gt(0.7).any()]
kudos_rel_df
Out[21]:
distance kudos Distance Max Core Temperature Aerobic TISS Distance Swim TRIMP Points TRIMP(100) Points TRIMP Zonal Points
kudos 0.722458 1.0 0.722582 0.73212 0.719443 0.722582 0.740524 0.740524 0.722062

Analysis


From the above code, we can conclude that the below factors leads to more Kudos/Likes:

distance ( from strava file), Distance (from GoldenCheetah), Work, Aerobic TISS, P6 Time in Pace Zone, Distance Swim, L7 Time in Zone, TRIMP Zonal Points, W' Work and W2 W'bal Moderate Fatigue variables with a correlation of above 0.7 for each with kudos.

In [22]:
fig,axs=plt.subplots(3,1,figsize=(8,20))

#Kudos relationship with distance
axs[0].scatter(result['kudos'], result['Distance'], color='red')
axs[0].set_xlabel('kudos' ,size =20)
axs[0].set_ylabel('Distance',size =20)
axs[0].set_title('Distance with Kudos',size =20)

#Kudos relationship with Elevation Gain
axs[1].scatter(result['kudos'], result['Elevation Gain'], color='black')
axs[1].set_xlabel('kudos',size =20)
axs[1].set_ylabel('Elevation Gain',size =20)
axs[1].set_title('Elevation Gain with Kudos',size =20)

#Kudos relationship with NP
axs[2].scatter(result['kudos'], result['NP'], color='green')
axs[2].set_xlabel('kudos',size =20)
axs[2].set_ylabel('NP',size =20)
axs[2].set_title('NP with Kudos',size =20)
plt.show()

Analysis


Above scatter plots depicts the relationship between kudos and main variables such distance, elevation_gain and NP. We can summarize below scatter plots as below :

1) In the first scatter plot between kudos and Distance, it illustrates as the distance is increasing there is increase in the kduos as well. This clearly depicts that distanc and kudos are proportional to each other.

2) In the second scatter plot between kudos and elevation_gain, it shows mixed trend between the two variables. Elevation_gain increases but kudos remains constant and vice versa.

3) In the last scatter plot between kudos and NP, it shows that relationship between these two is starting from the high value not from the 0.

In [23]:
month_name=[]
for e in result.index:
    month_name.append((calendar.month_name[e.month]))
result.insert(2, "Month", month_name, True) 
month_data_df=pd.DataFrame(result.groupby(['Month']).distance.sum())
month_data_df['TSS']=result.groupby(['Month']).TSS.sum()
month_data_df['Average Speed']=result.groupby(['Month'])['Average Speed'].mean()
month_data_df
Out[23]:
distance TSS Average Speed
Month
April 782.4 2274.6801 24.918635
August 127.5 522.1341 25.169500
December 400.4 1106.8375 27.075000
February 700.8 1966.2923 26.016750
January 390.8 1083.7889 26.182430
July 461.5 1241.1682 25.500000
June 560.2 1617.3223 27.086453
March 871.1 2762.8343 26.268610
May 598.3 1702.5335 25.351987
November 531.0 1431.8949 25.324731
October 203.5 587.3830 22.893875
September 123.2 347.4977 27.286367

Analysis


Above code is to predict get the distance travelled ,TSS and average speed month wise. It is clearly show that most distance is travelled in the March month and it has highest TSS score as well whereas September has highest average speed recorded. There two bar graphs in the end which depicts this analysis in a graph form for better understanding.

In [24]:
month_data_df.plot( y=["distance", "Average Speed", "TSS"], kind="bar")
plt.rcParams["figure.figsize"] = (8,10)
plt.legend(loc='best', prop={'size': 8})
plt.show()
In [25]:
month_data_df.plot( y=["Average Speed"], kind="bar", color="c")
plt.rcParams["figure.figsize"] = (9,11)
plt.legend(loc='best', prop={'size':10})
plt.show()

Conclusion

We can conclude that March month has the highest distance travelled by the cyclists and their TSS score is also maximum in the same month, whereas September month has lowest the distance travelled.

Average Speed is highest during the September month .

Portfolio 2

Data driven prediction models of energy use of appliances in a low-energy house.

In [1]:
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.3f' % x)
from pandas import DataFrame
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import RFE
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
import calendar
from matplotlib import pyplot as plt
from datetime import timedelta
from datetime import time
from datetime import date
from datetime import datetime
plt.style.use('seaborn')
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

Loading Data


The first dataset is the complete data which consists all data of the energy consumption. Then we have two more datasets training and testing datasets which are the subset of the complete data. We wil train a certain set of data and test on other data to check the correctness of the model developed.

We are changing data type of date column to the date for the analysis.

In [2]:
complete_data = pd.read_csv('data/energydata_complete.csv',parse_dates=True)
complete_data["date"]=pd.to_datetime(complete_data["date"])
complete_data.head()
Out[2]:
date Appliances lights T1 RH_1 T2 RH_2 T3 RH_3 T4 ... T9 RH_9 T_out Press_mm_hg RH_out Windspeed Visibility Tdewpoint rv1 rv2
0 2016-01-11 17:00:00 60 30 19.890 47.597 19.200 44.790 19.790 44.730 19.000 ... 17.033 45.530 6.600 733.500 92.000 7.000 63.000 5.300 13.275 13.275
1 2016-01-11 17:10:00 60 30 19.890 46.693 19.200 44.722 19.790 44.790 19.000 ... 17.067 45.560 6.483 733.600 92.000 6.667 59.167 5.200 18.606 18.606
2 2016-01-11 17:20:00 50 30 19.890 46.300 19.200 44.627 19.790 44.933 18.927 ... 17.000 45.500 6.367 733.700 92.000 6.333 55.333 5.100 28.643 28.643
3 2016-01-11 17:30:00 50 40 19.890 46.067 19.200 44.590 19.790 45.000 18.890 ... 17.000 45.400 6.250 733.800 92.000 6.000 51.500 5.000 45.410 45.410
4 2016-01-11 17:40:00 60 40 19.890 46.333 19.200 44.530 19.790 45.000 18.890 ... 17.000 45.400 6.133 733.900 92.000 5.667 47.667 4.900 10.084 10.084

5 rows × 29 columns

Below is the variable description of the various variables used in the dataset.


  1. date time : year-month-day hour:minute:second
  2. Appliances : energy use in Wh
  3. lights : energy use of light fixtures in the house in Wh
  4. T1 : Temperature in kitchen area in Celsius
  5. RH_1 : Humidity in kitchen area in %
  6. T2 : Temperature in living room area in Celsius
  7. RH_2 : Humidity in living room area in %
  8. T3 : Temperature in laundry room area
  9. RH_3 : Humidity in laundry room area in %
  10. T4 : Temperature in office room in Celsius
  11. RH_4 : Humidity in office room in %
  12. T5 : Temperature in bathroom in Celsius
  13. RH_5 : Humidity in bathroom in %
  14. T6 : Temperature outside the building (north side) in Celsius
  15. RH_6 : Humidity outside the building (north side) in %
  16. T7 : Temperature in ironing room in Celsius
  17. RH_7 : Humidity in ironing room in %
  18. T8 : Temperature in teenager room 2 in Celsius
  19. RH_8 : Humidity in teenager room 2 in %
  20. T9 : Temperature in parents room in Celsius
  21. RH_9 : Humidity in parents room in %
  22. To : Temperature outside (from Chièvres weather station) in Celsius
  23. Pressure (from Chièvres weather station) in mm Hg
  24. RH_out : Humidity outside (from Chièvres weather station)in %
  25. Windspeed (from Chièvres weather station) in m/s
  26. Visibility (from Chièvres weather station) in km
  27. Tdewpoint (from Chièvres weather station) °C
  28. rv1 : Random variable 1 nondimensional
  29. rv2 : Rnadom variable 2 ondimensional
In [3]:
complete_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19735 entries, 0 to 19734
Data columns (total 29 columns):
date           19735 non-null datetime64[ns]
Appliances     19735 non-null int64
lights         19735 non-null int64
T1             19735 non-null float64
RH_1           19735 non-null float64
T2             19735 non-null float64
RH_2           19735 non-null float64
T3             19735 non-null float64
RH_3           19735 non-null float64
T4             19735 non-null float64
RH_4           19735 non-null float64
T5             19735 non-null float64
RH_5           19735 non-null float64
T6             19735 non-null float64
RH_6           19735 non-null float64
T7             19735 non-null float64
RH_7           19735 non-null float64
T8             19735 non-null float64
RH_8           19735 non-null float64
T9             19735 non-null float64
RH_9           19735 non-null float64
T_out          19735 non-null float64
Press_mm_hg    19735 non-null float64
RH_out         19735 non-null float64
Windspeed      19735 non-null float64
Visibility     19735 non-null float64
Tdewpoint      19735 non-null float64
rv1            19735 non-null float64
rv2            19735 non-null float64
dtypes: datetime64[ns](1), float64(26), int64(2)
memory usage: 4.4 MB

Analysis


We are checking the data types of the all variables used in the dataset. date column is date type , appliances & lights are integer type and all other variables are float type.

In [4]:
sns.heatmap(complete_data.isnull(),yticklabels=False,cbar=False,cmap='YlGnBu')
complete_data.isnull().sum()
Out[4]:
date           0
Appliances     0
lights         0
T1             0
RH_1           0
T2             0
RH_2           0
T3             0
RH_3           0
T4             0
RH_4           0
T5             0
RH_5           0
T6             0
RH_6           0
T7             0
RH_7           0
T8             0
RH_8           0
T9             0
RH_9           0
T_out          0
Press_mm_hg    0
RH_out         0
Windspeed      0
Visibility     0
Tdewpoint      0
rv1            0
rv2            0
dtype: int64

Analysis


We are checking any null values in the dataset and we find that there is no null value in the dataset.

In [5]:
complete_data.describe()
Out[5]:
Appliances lights T1 RH_1 T2 RH_2 T3 RH_3 T4 RH_4 ... T9 RH_9 T_out Press_mm_hg RH_out Windspeed Visibility Tdewpoint rv1 rv2
count 19735.000 19735.000 19735.000 19735.000 19735.000 19735.000 19735.000 19735.000 19735.000 19735.000 ... 19735.000 19735.000 19735.000 19735.000 19735.000 19735.000 19735.000 19735.000 19735.000 19735.000
mean 97.695 3.802 21.687 40.260 20.341 40.420 22.268 39.243 20.855 39.027 ... 19.486 41.552 7.412 755.523 79.750 4.040 38.331 3.761 24.988 24.988
std 102.525 7.936 1.606 3.979 2.193 4.070 2.006 3.255 2.043 4.341 ... 2.015 4.151 5.317 7.399 14.901 2.451 11.795 4.195 14.497 14.497
min 10.000 0.000 16.790 27.023 16.100 20.463 17.200 28.767 15.100 27.660 ... 14.890 29.167 -5.000 729.300 24.000 0.000 1.000 -6.600 0.005 0.005
25% 50.000 0.000 20.760 37.333 18.790 37.900 20.790 36.900 19.530 35.530 ... 18.000 38.500 3.667 750.933 70.333 2.000 29.000 0.900 12.498 12.498
50% 60.000 0.000 21.600 39.657 20.000 40.500 22.100 38.530 20.667 38.400 ... 19.390 40.900 6.917 756.100 83.667 3.667 40.000 3.433 24.898 24.898
75% 100.000 0.000 22.600 43.067 21.500 43.260 23.290 41.760 22.100 42.157 ... 20.600 44.338 10.408 760.933 91.667 5.500 40.000 6.567 37.584 37.584
max 1080.000 70.000 26.260 63.360 29.857 56.027 29.236 50.163 26.200 51.090 ... 24.500 53.327 26.100 772.300 100.000 14.000 66.000 15.500 49.997 49.997

8 rows × 28 columns

Analysis


We are checking the statistical variables of the dataset such as count, mean (average), standard deviation ,many more.

Analysis begins on energy data

In [6]:
date_appliances_data=complete_data[["date","Appliances"]]
date_appliances_data.plot(kind='line',x='date', y='Appliances',color='brown')
plt.xlabel("Time" ,size =20)
plt.ylabel("Appliances Wh" , size =20)
plt.subplots_adjust(right=3)

#1st week data
date_appliances_data[1:1008].plot(kind='line',x='date', y='Appliances',color='grey')
plt.xlabel("Time 1 week",size =20)
plt.ylabel("Appliances Wh",size =20)
plt.subplots_adjust(right=3)

Analysis


There are two line graphs plotted, first is depicting the whole data between appliances and date. Whereas, second graph illustrates one week usage of appliances. Highest amount of energy consumed between 15th and 16th of January.

In [7]:
date_appliances_data['Appliances'].plot.hist(bins=40,grid=True,color='green') 
plt.subplots_adjust(right=2)
plt.xlim([0,1200])
plt.xlabel("Appliances Wh")
plt.ylim(0,10000)
plt.ylabel("Frequency")
Out[7]:
Text(0, 0.5, 'Frequency')

Analysis


Histogram plot depicts the appliances usage with frequency on the y axis of range 0,1000 . We can assume that the distribution is skewed because tail is tilted towards the right.

In [8]:
ax = sns.boxplot(x=date_appliances_data["Appliances"])
plt.xlabel("Appliances Wh")
plt.subplots_adjust(right=3)
plt.show()

Analysis

As the above boxplot depicts there are many outliers in the data, it means there are plenty of values are outside of the range.

Energy complete has been divided into training and testing dataset for the further analysis.

In [9]:
train_data = pd.read_csv('data/energydata_training.csv' ,index_col='date')
print ("Shape of the training dataset is : ",train_data.shape)
train_data.head()
Shape of the training dataset is :  (14803, 31)
Out[9]:
Appliances lights T1 RH_1 T2 RH_2 T3 RH_3 T4 RH_4 ... Press_mm_hg RH_out Windspeed Visibility Tdewpoint rv1 rv2 NSM WeekStatus Day_of_week
date
2016-01-11 17:00:00 60 30 19.890 47.597 19.200 44.790 19.790 44.730 19.000 45.567 ... 733.500 92.000 7.000 63.000 5.300 13.275 13.275 61200 Weekday Monday
2016-01-11 17:10:00 60 30 19.890 46.693 19.200 44.722 19.790 44.790 19.000 45.992 ... 733.600 92.000 6.667 59.167 5.200 18.606 18.606 61800 Weekday Monday
2016-01-11 17:20:00 50 30 19.890 46.300 19.200 44.627 19.790 44.933 18.927 45.890 ... 733.700 92.000 6.333 55.333 5.100 28.643 28.643 62400 Weekday Monday
2016-01-11 17:40:00 60 40 19.890 46.333 19.200 44.530 19.790 45.000 18.890 45.530 ... 733.900 92.000 5.667 47.667 4.900 10.084 10.084 63600 Weekday Monday
2016-01-11 17:50:00 50 40 19.890 46.027 19.200 44.500 19.790 44.933 18.890 45.730 ... 734.000 92.000 5.333 43.833 4.800 44.919 44.919 64200 Weekday Monday

5 rows × 31 columns

Analysis


We are reading the training dataset and it has shape 14803 * 31 (i.e., 14803 rows and 31 columns).

In [10]:
train_data.describe()
Out[10]:
Appliances lights T1 RH_1 T2 RH_2 T3 RH_3 T4 RH_4 ... RH_9 T_out Press_mm_hg RH_out Windspeed Visibility Tdewpoint rv1 rv2 NSM
count 14803.000 14803.000 14803.000 14803.000 14803.000 14803.000 14803.000 14803.000 14803.000 14803.000 ... 14803.000 14803.000 14803.000 14803.000 14803.000 14803.000 14803.000 14803.000 14803.000 14803.000
mean 98.011 3.803 21.684 40.271 20.343 40.418 22.263 39.249 20.855 39.030 ... 41.542 7.413 755.503 79.734 4.034 38.330 3.757 25.078 25.078 42985.989
std 102.828 7.940 1.608 3.982 2.192 4.066 2.014 3.253 2.045 4.340 ... 4.151 5.324 7.428 14.956 2.437 11.813 4.200 14.482 14.482 24968.649
min 10.000 0.000 16.790 27.023 16.100 20.893 17.200 28.767 15.100 27.660 ... 29.167 -5.000 729.300 24.000 0.000 1.000 -6.600 0.005 0.005 0.000
25% 50.000 0.000 20.730 37.362 18.823 37.900 20.790 36.900 19.500 35.530 ... 38.500 3.667 750.867 70.000 2.000 29.000 0.900 12.580 12.580 21600.000
50% 60.000 0.000 21.600 39.657 20.000 40.500 22.100 38.530 20.667 38.400 ... 40.863 6.900 756.100 83.667 3.667 40.000 3.450 25.044 25.044 43200.000
75% 100.000 0.000 22.600 43.090 21.500 43.290 23.290 41.762 22.100 42.130 ... 44.363 10.400 760.933 91.667 5.500 40.000 6.533 37.666 37.666 64800.000
max 1080.000 50.000 26.260 63.360 29.857 56.027 29.236 50.163 26.200 51.063 ... 53.327 25.967 772.300 100.000 13.500 66.000 15.500 49.997 49.997 85800.000

8 rows × 29 columns

Analysis


We are exploring the training dataset with its statisical values .

In [11]:
test_data = pd.read_csv('data/energydata_testing.csv' ,index_col='date')
print ("Shape of the testing dataset is : ",train_data.shape)
test_data.head()
Shape of the testing dataset is :  (14803, 31)
Out[11]:
Appliances lights T1 RH_1 T2 RH_2 T3 RH_3 T4 RH_4 ... Press_mm_hg RH_out Windspeed Visibility Tdewpoint rv1 rv2 NSM WeekStatus Day_of_week
date
2016-01-11 17:30:00 50 40 19.890 46.067 19.200 44.590 19.790 45.000 18.890 45.723 ... 733.800 92.000 6.000 51.500 5.000 45.410 45.410 63000 Weekday Monday
2016-01-11 18:00:00 60 50 19.890 45.767 19.200 44.500 19.790 44.900 18.890 45.790 ... 734.100 92.000 5.000 40.000 4.700 47.234 47.234 64800 Weekday Monday
2016-01-11 18:40:00 230 70 19.927 45.863 19.357 44.400 19.790 44.900 18.890 46.430 ... 734.367 91.333 5.667 40.000 4.633 10.299 10.299 67200 Weekday Monday
2016-01-11 18:50:00 580 60 20.067 46.397 19.427 44.400 19.790 44.827 19.000 46.430 ... 734.433 91.167 5.833 40.000 4.617 8.828 8.828 67800 Weekday Monday
2016-01-11 19:30:00 100 10 20.567 53.893 20.033 46.757 20.100 48.467 19.000 48.490 ... 734.850 89.500 6.000 40.000 4.350 24.885 24.885 70200 Weekday Monday

5 rows × 31 columns

Analysis


We are reading the testing dataset and it has shape 14803 * 31 (i.e., 14803 rows and 31 columns).

In [12]:
test_data.describe()
Out[12]:
Appliances lights T1 RH_1 T2 RH_2 T3 RH_3 T4 RH_4 ... RH_9 T_out Press_mm_hg RH_out Windspeed Visibility Tdewpoint rv1 rv2 NSM
count 4932.000 4932.000 4932.000 4932.000 4932.000 4932.000 4932.000 4932.000 4932.000 4932.000 ... 4932.000 4932.000 4932.000 4932.000 4932.000 4932.000 4932.000 4932.000 4932.000 4932.000
mean 96.746 3.800 21.694 40.225 20.337 40.428 22.283 39.223 20.855 39.017 ... 41.583 7.408 755.581 79.799 4.056 38.333 3.772 24.718 24.718 42670.438
std 101.614 7.924 1.601 3.972 2.197 4.081 1.983 3.260 2.037 4.346 ... 4.154 5.299 7.314 14.738 2.494 11.742 4.178 14.540 14.540 24854.921
min 10.000 0.000 16.790 27.430 16.100 20.463 17.200 30.927 15.100 28.857 ... 29.167 -4.956 729.500 25.000 0.000 1.000 -6.500 0.014 0.014 0.000
25% 50.000 0.000 20.766 37.290 18.790 37.900 20.790 36.863 19.567 35.500 ... 38.626 3.700 751.050 70.833 2.000 29.000 0.913 12.339 12.339 21000.000
50% 60.000 0.000 21.600 39.660 20.000 40.500 22.100 38.530 20.600 38.400 ... 41.000 6.950 756.100 83.667 3.667 40.000 3.400 24.491 24.491 42000.000
75% 100.000 0.000 22.667 42.975 21.500 43.161 23.367 41.760 22.100 42.171 ... 44.290 10.417 760.900 91.500 5.500 40.000 6.683 37.375 37.375 64200.000
max 1070.000 70.000 26.260 56.393 29.667 54.767 29.100 50.090 26.200 51.090 ... 53.327 26.100 772.283 100.000 14.000 66.000 15.317 49.993 49.993 85800.000

8 rows × 29 columns

Analysis


We are exploring the testing dataset with its statisical values .

In [13]:
train_data_corr =train_data [["Appliances", "lights","T1","RH_1","T2","RH_2","T3","RH_3"]]
sns.pairplot(train_data_corr)
Out[13]:
<seaborn.axisgrid.PairGrid at 0x1b00026a240>

Analysis


Plotting the first pairplot by using the training data between "Appliances", "lights","T1","RH_1","T2","RH_2","T3" and "RH_3. From this pairplot, we can make below conclusions :

  1. There is a positive correlation between Appliances and lights.
  2. There is a positive correlation between T1 and T3.
In [14]:
train_data_corr=train_data_corr.corr()
train_data_corr
Out[14]:
Appliances lights T1 RH_1 T2 RH_2 T3 RH_3
Appliances 1.000 0.195 0.060 0.087 0.125 -0.061 0.093 0.037
lights 0.195 1.000 -0.028 0.113 -0.012 0.059 -0.099 0.135
T1 0.060 -0.028 1.000 0.167 0.838 0.001 0.893 -0.026
RH_1 0.087 0.113 0.167 1.000 0.273 0.798 0.257 0.845
T2 0.125 -0.012 0.838 0.273 1.000 -0.161 0.736 0.123
RH_2 -0.061 0.059 0.001 0.798 -0.161 1.000 0.142 0.681
T3 0.093 -0.099 0.893 0.257 0.736 0.142 1.000 -0.008
RH_3 0.037 0.135 -0.026 0.845 0.123 0.681 -0.008 1.000

Analysis


Above piece of code shows the correlation matrix between various variables. It is similar to the pairplot , only difference is the pairplot visualize correlation with graphs whereas correlation matrix illustrates matrix.

In [15]:
complete_data['day_name'] = complete_data['date'].dt.weekday_name;
complete_data['hour'] = complete_data['date'].dt.hour;

Analysis


We are creating two new variables , assigining weekday name to day_name column and hour to hour column.

In [16]:
Firstmon = complete_data.loc[(complete_data.date >= '2016-01-01') & (complete_data.date <= '2016-01-28')]
Secondmon = complete_data.loc[(complete_data.date >= '2016-02-01') & (complete_data.date <= '2016-02-28')]
Thirdmon = complete_data.loc[(complete_data.date >= '2016-03-01') & (complete_data.date <= '2016-03-28')]
Fourthmon = complete_data.loc[(complete_data.date >= '2016-04-01') & (complete_data.date <= '2016-04-28')]

Analysis


We are splitting data based on months such as first , second , third and fourth.

In [17]:
Firstmonv2  = pd.pivot_table(Firstmon[['day_name','hour','Appliances']],index=['day_name','hour'],aggfunc='sum')
Secondmonv2 = pd.pivot_table(Secondmon[['day_name','hour','Appliances']],index=['day_name','hour'],aggfunc='sum')
Thirdmonv2  = pd.pivot_table(Thirdmon[['day_name','hour','Appliances']],index=['day_name','hour'],aggfunc='sum')
Fourthmonv2 = pd.pivot_table(Fourthmon[['day_name','hour','Appliances']],index=['day_name','hour'],aggfunc='sum')

Analysis


We are creating a new pivot table for each month with day_name and hour.

In [18]:
Firstmonv3  =Firstmonv2.unstack(level=0)
Secondmonv3 =Secondmonv2.unstack(level=0)
Thirdmonv3  =Thirdmonv2.unstack(level=0)
Fourthmonv3 =Fourthmonv2.unstack(level=0)

Analysis


Above piece of code is unstacking the data frames.

In [19]:
Firstmonv3  = Firstmonv3.reindex(labels=['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'],axis=1,level=1)
Secondmonv3 = Secondmonv3.reindex(labels=['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'],axis=1,level=1)
Thirdmonv3  = Thirdmonv3.reindex(labels=['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'],axis=1,level=1)
Fourthmonv3 = Fourthmonv3.reindex(labels=['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'],axis=1,level=1)

Analysis


We are ordering the labels in the order.

In [20]:
day_short_names = ['Sun','Mon','Tues','Wed','Thurs','Fri','Sat']

Analysis


We are assinging the shorter names for the weekdays.

In [21]:
f,ax =plt.subplots(figsize=(5,15))
ax=sns.heatmap(Firstmonv3,cmap="YlGnBu",linewidths=.5, xticklabels=day_short_names)
ax.set(xlabel = 'Day of the Week',ylabel='Hour of Day')
Out[21]:
[Text(25.0, 0.5, 'Hour of Day'), Text(0.5, 115.0, 'Day of the Week')]

Analysis


1st week heatmap

Aformentioned code is creating a heatmap for the first week to depicts the trends of the energy consumption in a day.We can conclude that energy consumption is highest during the evening time of the day.

In [22]:
f,ax =plt.subplots(figsize=(5,15))
ax=sns.heatmap(Secondmonv3,cmap="BuPu",linewidths=.5, xticklabels=day_short_names)
ax.set(xlabel = 'Day of the Week',ylabel='Hour of Day')
Out[22]:
[Text(25.0, 0.5, 'Hour of Day'), Text(0.5, 115.0, 'Day of the Week')]

Analysis


2nd Week heatmap

Aformentioned code is creating a heatmap for the second week to depicts the trends of the energy consumption in a day.We can conclude that energy consumption is highest at the 1800 hours.

In [23]:
f,ax =plt.subplots(figsize=(5,15))
ax=sns.heatmap(Thirdmonv3,cmap="Blues",linewidths=.5, xticklabels=day_short_names)
ax.set(xlabel = 'Day of the Week',ylabel='Hour of Day')
Out[23]:
[Text(25.0, 0.5, 'Hour of Day'), Text(0.5, 115.0, 'Day of the Week')]

Analysis


3rd Week heatmap

Aformentioned code is creating a heatmap for the third week to depicts the trends of the energy consumption in a day.We can conclude that energy consumption is highest between 10000 hours and 11oo hours , during 1800 hours throughout the week.

In [24]:
f,ax =plt.subplots(figsize=(5,15))
ax=sns.heatmap(Fourthmonv3,cmap="Greens",linewidths=.5, xticklabels=day_short_names)
ax.set(xlabel = 'Day of the Week',ylabel='Hour of Day')
Out[24]:
[Text(25.0, 0.5, 'Hour of Day'), Text(0.5, 115.0, 'Day of the Week')]

Analysis


4th Week heatmap

Aformentioned code is creating a heatmap for the fourth week to depicts the trends of the energy consumption in a day.We can conclude that energy consumption is highest at 10000 hours on friday and 1700 hours on tuesday.

In [25]:
train_data_corr2 =train_data [["Appliances", "T4", "RH_4","T5", "RH_5", "T6", "RH_6"]]
sns.pairplot(train_data_corr2,diag_kind = 'kde',plot_kws = {'alpha': 0.4, 's': 80, 'edgecolor': 'y'})
Out[25]:
<seaborn.axisgrid.PairGrid at 0x1b005187470>

Analysis


Plotting the second pairplot by using the training data between the columns Appliances,T4,RH_4,T5,RH_5,T6,RH_6 We can draw the below conclusions from this plot:

  1. There is a high correlation between Appliances and outdoor temperature (T6).
  2. There is a negative correlation between Appliances and outdoor humidity (RH6).
In [26]:
train_data_corr2.corr()
Out[26]:
Appliances T4 RH_4 T5 RH_5 T6 RH_6
Appliances 1.000 0.047 0.016 0.023 0.006 0.122 -0.087
T4 0.047 1.000 -0.045 0.871 -0.077 0.656 -0.702
RH_4 0.016 -0.045 1.000 0.093 0.353 0.260 0.391
T5 0.023 0.871 0.093 1.000 0.033 0.630 -0.631
RH_5 0.006 -0.077 0.353 0.033 1.000 -0.081 0.267
T6 0.122 0.656 0.260 0.630 -0.081 1.000 -0.672
RH_6 -0.087 -0.702 0.391 -0.631 0.267 -0.672 1.000

Analysis


Above piece of code shows the correlation matrix between various variables. It is similar to the pairplot , only difference is the pairplot visualize correlation with graphs whereas correlation matrix illustrates matrix.

In [27]:
train_data_corr3 =train_data [["Appliances", "T7", "RH_7","T8", "RH_8", "T9","RH_9"]]
sns.pairplot(train_data_corr3,diag_kind = 'kde',plot_kws = {'alpha': 0.2, 's': 50, 'edgecolor': 'k'})
Out[27]:
<seaborn.axisgrid.PairGrid at 0x1b007b43c50>

Analysis

Pairplot shows the correlation between various variables and to make it more understandable. T7 has low correlation with RH_7 , RH_8 and RH_9 where Appliances has low correlation with RH_7 and RH_8 and RH_9.

In [28]:
train_data_corr3.corr()
Out[28]:
Appliances T7 RH_7 T8 RH_8 T9 RH_9
Appliances 1.000 0.032 -0.056 0.046 -0.094 0.016 -0.050
T7 0.032 1.000 -0.034 0.882 -0.211 0.944 -0.076
RH_7 -0.056 -0.034 1.000 -0.121 0.885 0.029 0.859
T8 0.046 0.882 -0.121 1.000 -0.209 0.869 -0.154
RH_8 -0.094 -0.211 0.885 -0.209 1.000 -0.114 0.856
T9 0.016 0.944 0.029 0.869 -0.114 1.000 -0.007
RH_9 -0.050 -0.076 0.859 -0.154 0.856 -0.007 1.000

Analysis


Above piece of code shows the correlation matrix between various variables. It is similar to the pairplot , only difference is the pairplot visualize correlation with graphs whereas correlation matrix illustrates matrix.

In [29]:
train_data_corr4 =train_data [["Appliances", "T_out", "Press_mm_hg", "RH_out", "Windspeed","Visibility", "Tdewpoint", "NSM","T6"]]
sns.pairplot(train_data_corr4,diag_kind = 'kde',plot_kws = {'alpha': 0.2, 's': 100, 'edgecolor': 'b'})
Out[29]:
<seaborn.axisgrid.PairGrid at 0x1b00b874a90>

Analysis


Pairplot shows the correlation between various variables and to make it more understandable.We can conclude that Appliances and NSM are highley correlated whereas it has weakest correlation with Visibility.

In [30]:
train_data_corr4.corr()
Out[30]:
Appliances T_out Press_mm_hg RH_out Windspeed Visibility Tdewpoint NSM T6
Appliances 1.000 0.104 -0.032 -0.155 0.085 -0.005 0.020 0.216 0.122
T_out 0.104 1.000 -0.137 -0.574 0.194 -0.077 0.789 0.222 0.975
Press_mm_hg -0.032 -0.137 1.000 -0.099 -0.231 0.041 -0.241 -0.004 -0.135
RH_out -0.155 -0.574 -0.099 1.000 -0.174 0.087 0.039 -0.344 -0.568
Windspeed 0.085 0.194 -0.231 -0.174 1.000 -0.004 0.128 0.100 0.169
Visibility -0.005 -0.077 0.041 0.087 -0.004 1.000 -0.039 -0.027 -0.082
Tdewpoint 0.020 0.789 -0.241 0.039 0.128 -0.039 1.000 0.029 0.764
NSM 0.216 0.222 -0.004 -0.344 0.100 -0.027 0.029 1.000 0.205
T6 0.122 0.975 -0.135 -0.568 0.169 -0.082 0.764 0.205 1.000
In [31]:
train_data_heatmap =plt.subplots(figsize=(10,10))
ax=sns.heatmap(train_data_corr4.corr(),cmap="Blues",linewidths=1.5,annot=True)

Analysis


Visualizing correlation through the heatmap is also a good approach.

In [32]:
sns.distplot(train_data['Appliances'])
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x1b011155f98>

Analysis


Plotting a distplot for appliances to draw conclusion regarding the spread of the data. From this distplot, we can conclude that data is skewed to the right.

Training a model on Linear Regression


Let's begin training of data...

We first need to split the data into two arrays X,y; in which X has the variables which we train and y has target (whose value will be predicted by the trained variables.)

We will remove two columns namely WeekStatus and Day_of_week because both these columns are storing categorical values that the linear regression model cannot use.

In [33]:
train_data.columns
Out[33]:
Index(['Appliances', 'lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3', 'RH_3', 'T4',
       'RH_4', 'T5', 'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8', 'RH_8', 'T9',
       'RH_9', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed', 'Visibility',
       'Tdewpoint', 'rv1', 'rv2', 'NSM', 'WeekStatus', 'Day_of_week'],
      dtype='object')

Analysis


We are checking the columns in the training dataset.

In [34]:
X_train = train_data[['lights', 
                             'T1', 'RH_1', 'T2', 'RH_2', 'T3',
       'RH_3', 'T4', 'RH_4', 'T5', 'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8',
       'RH_8', 'T9', 'RH_9', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed',
       'Visibility', 'Tdewpoint', 'rv1', 'rv2','NSM']]
print (X_train.shape)
y_train = train_data['Appliances']
print (y_train.shape)
(14803, 28)
(14803,)

Analysis


X array has only 28 columns because 2 variables are categorical type so they are dropped. We are printing the X and y arrays.

In [35]:
from sklearn.linear_model import LinearRegression
linear_model = LinearRegression()
linear_model.fit(X_train,y_train)
Out[35]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Analysis


We are fitting data in our Linear Regression Model.

In [36]:
print(linear_model.intercept_)
print(linear_model.coef_)
-40.03362523484286
[ 1.86660516e+00 -4.14497987e+00  1.44790183e+01 -1.78426597e+01
 -1.37115412e+01  2.76620233e+01  5.34759535e+00 -2.51528298e+00
 -7.94526199e-01 -1.50241017e+00  9.21698362e-02  7.37898919e+00
  3.26591065e-01  2.00333400e+00 -1.74131349e+00  8.10390686e+00
 -3.64205959e+00 -1.33516314e+01 -3.25454078e-01 -1.01102847e+01
  1.86447658e-01 -9.03894716e-01  1.82733594e+00  1.33901648e-01
  4.19325863e+00 -2.45927477e-02 -2.45927477e-02  2.94814485e-04]

Analysis


We are calculating the Intercept and the coefficients of the training dataset.

In [37]:
coeff_df = pd.DataFrame(linear_model.coef_, X_train.columns,columns=['Coefficient'])
coeff_df
Out[37]:
Coefficient
lights 1.867
T1 -4.145
RH_1 14.479
T2 -17.843
RH_2 -13.712
T3 27.662
RH_3 5.348
T4 -2.515
RH_4 -0.795
T5 -1.502
RH_5 0.092
T6 7.379
RH_6 0.327
T7 2.003
RH_7 -1.741
T8 8.104
RH_8 -3.642
T9 -13.352
RH_9 -0.325
T_out -10.110
Press_mm_hg 0.186
RH_out -0.904
Windspeed 1.827
Visibility 0.134
Tdewpoint 4.193
rv1 -0.025
rv2 -0.025
NSM 0.000

Analysis


Above piece of code printing the coefficient of all variables, T3 has the highest model coefficient.

In [38]:
predictions_df = linear_model.predict(X_train)
In [39]:
comparison_df= pd.DataFrame({"Appliances Actual Value": y_train, "Appliances Predicted Value": predictions_df})
comparison_df.head()
Out[39]:
Appliances Actual Value Appliances Predicted Value
date
2016-01-11 17:00:00 60 164.094
2016-01-11 17:10:00 60 149.799
2016-01-11 17:20:00 50 145.082
2016-01-11 17:40:00 60 166.047
2016-01-11 17:50:00 50 159.361

Analysis


Priniting actual and predicted appliances values.

Testing on the Linear Regression Model

In [40]:
X_test = test_data[['lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3',
       'RH_3', 'T4', 'RH_4', 'T5', 'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8',
       'RH_8', 'T9', 'RH_9', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed',
       'Visibility', 'Tdewpoint', 'rv1', 'rv2','NSM']]
print (X_test.shape)
y_test = test_data['Appliances']
print (y_test.shape)
(4932, 28)
(4932,)

Analysis


We first need to split the data into two arrays X,y; in which X has the variables which we train and y has target (whose value will be predicted by the trained variables.)

We will remove two columns namely WeekStatus and Day_of_week because both these columns are storing categorical values that the linear regression model cannot use.

In [41]:
predictions = linear_model.predict(X_test)
mse = ((y_test - linear_model.predict(X_test))**2).mean()
print("RMSE:", np.sqrt(mse))
RMSE: 93.56425120887658

Analysis


RMSE : RMSE is a quadratic scoring rule that also calculates the average magnitude of the error. It's the square root of the average squared differences between prediction and actual observation.

In [42]:
print("MSE:", mean_squared_error(y_test, predictions))
print("R^2:", r2_score(y_test, predictions))
MSE: 8754.269104277759
R^2: 0.15199183390308568

Analysis


MSE :The Mean Squared Error (MSE) of an estimator measures the average of error squares i.e. the average squared difference between the estimated values and true value.

R-squared (R2) : is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model

In [43]:
comparison_test_df= pd.DataFrame({"Actual Appliances' values": y_test, "Predicted Appliances' values": predictions})
comparison_test_df.head()
Out[43]:
Actual Appliances' values Predicted Appliances' values
date
2016-01-11 17:30:00 50 159.463
2016-01-11 18:00:00 60 173.740
2016-01-11 18:40:00 230 211.456
2016-01-11 18:50:00 580 196.957
2016-01-11 19:30:00 100 189.807
In [44]:
scatterplot =plt.subplots(figsize=(10,10))
ax=plt.scatter(abs(y_test-predictions),test_data['Appliances'])
plt.xlim([-100,1200])
plt.xlabel("Appliances Energy Consumption", size = 10)
plt.ylim(-300,1400)
plt.ylabel("Residuals", size = 10)
Out[44]:
Text(0, 0.5, 'Residuals')

Analysis


From the above plot, we can conclude that variables used in the linear model for the estimate are not good enough because resdiuals are not distributed around the horizontal line.

In [45]:
linear_model = LinearRegression()
rfe = RFE(linear_model, 2)
X_rfe = rfe.fit_transform(X_train,y_train)  
linear_model.fit(X_rfe,y_train)
print(rfe.support_)
print(rfe.ranking_)
[False False  True False  True False False False False False False False
 False False False False False False False False False False False False
 False False False False]
[11 12  1  2  1  3  7 10 21 17 23  8 20 15 14  5  6  4 18  9 22 19 13 24
 16 25 26 27]

Analysis


From the above analysis, we can say that Appliances value is predicted by taking all the independent columns from the data and here we got high MSE.

We are going to use theRFE (i.e., Recursive Feature Elminitaion ) method to make predictions based on selected variables only.

In [46]:
nof_list=np.arange(1,28)            
high_score=0
nof=0           
score_list =[]
for n in range(len(nof_list)):
    model = LinearRegression()
    rfe = RFE(model,nof_list[n])
    X_train_rfe = rfe.fit_transform(X_train,y_train)
    X_test_rfe = rfe.transform(X_test)
    model.fit(X_train_rfe,y_train)
    score = model.score(X_test_rfe,y_test)
    score_list.append(score)
    if(score>high_score):
        high_score = score
        nof = nof_list[n]
print("Optimum number of features: %d" %nof)
print("Score with %d features: %f" % (nof, high_score))
Optimum number of features: 25
Score with 25 features: 0.149535

Analysis


After implementing the RFE, 25 variables are selected based on theor coefficients.

In [47]:
cols = list(X_train.columns)
model = LinearRegression()
rfe = RFE(model, 25)             
X_rfe = rfe.fit_transform(X_train,y_train)  
model.fit(X_rfe,y_train)              
temp = pd.Series(rfe.support_,index = cols)
selected_features_rfe = temp[temp==True].index
print(selected_features_rfe)
Index(['lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3', 'RH_3', 'T4', 'RH_4', 'T5',
       'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8', 'RH_8', 'T9', 'RH_9', 'T_out',
       'Press_mm_hg', 'RH_out', 'Windspeed', 'Visibility', 'Tdewpoint'],
      dtype='object')

Analysis


Above code prints the variables selected by the RFE method.

In [48]:
X_train = train_data[['lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3', 'RH_3', 'T4', 'RH_4', 'T5',
       'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8', 'RH_8', 'T9', 'RH_9', 'T_out',
       'Press_mm_hg', 'RH_out', 'Windspeed', 'Visibility', 'Tdewpoint']]
print (X_train.shape)
(14803, 25)

Analysis


We are keeping only required columns from the training dataset.

In [49]:
linear_model_new = LinearRegression()
linear_model_new.fit(X_train,y_train)
Out[49]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [50]:
X_test = test_data[['lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3', 'RH_3', 'T4', 'RH_4', 'T5',
       'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8', 'RH_8', 'T9', 'RH_9', 'T_out',
       'Press_mm_hg', 'RH_out', 'Windspeed', 'Visibility', 'Tdewpoint']]
print (X_test.shape)
(4932, 25)
In [51]:
predictions = linear_model_new.predict(X_test)
In [52]:
mse = ((y_test - linear_model_new.predict(X_test))**2).mean()
print("RMSE:", np.sqrt(mse))
RMSE: 93.69971138183963
In [53]:
print("MSE:", mean_squared_error(y_test, predictions))
print("R^2:", r2_score(y_test, predictions))
MSE: 8779.635913040014
R^2: 0.14953460295416499
In [54]:
compare_df_rfe= pd.DataFrame({"Actual Appliances' values": y_test, "Predicted Appliances' values": predictions})
compare_df_rfe.head()
Out[54]:
Actual Appliances' values Predicted Appliances' values
date
2016-01-11 17:30:00 50 158.324
2016-01-11 18:00:00 60 172.637
2016-01-11 18:40:00 230 209.879
2016-01-11 18:50:00 580 195.107
2016-01-11 19:30:00 100 191.219

Analysis


Priniting actual and predicted appliances values.

In [55]:
scatterplot =plt.subplots(figsize=(10,10))
x=plt.scatter(abs(y_test-predictions),test_data['Appliances'])
plt.xlim([-100,1200])
plt.xlabel("Appliances Energy Consumption", size = 10)
plt.ylim(-300,1400)
plt.ylabel("Residuals", size = 10)
Out[55]:
Text(0, 0.5, 'Residuals')

Conclusion


We can see that still the Mean Squared Error has a quite high value even after using the Recursive feature elimination method.

We can say that a good model depends upon the number of variables used in the model.In this portfolio, model can be make better if we use significant number of variables in prediction.

Portfolio 3 - Clustering Visualisation

K-means clustering is one of the simplest and popular unsupervised learning algorithms. Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labelled, outcomes. This notebook illustrates the process of K-means clustering by generating some random clusters of data and then showing the iterations of the algorithm as random cluster means are updated.

We first generate random data around 4 centers.

In [1]:
import numpy as np 
import pandas as pd 
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
%matplotlib inline
In [2]:
center_1 = np.array([1,2])
center_2 = np.array([6,6])
center_3 = np.array([9,1])
center_4 = np.array([-5,-1])

#Generate random data and center it to the four centers each with a different variance
np.random.seed(5)
data_1 = np.random.randn(200,2) * 1.5 + center_1
data_2 = np.random.randn(200,2) * 1 + center_2
data_3 = np.random.randn(200,2) * 0.5 + center_3
data_4 = np.random.randn(200,2) * 0.8 + center_4

data = np.concatenate((data_1, data_2, data_3, data_4), axis = 0)

plt.scatter(data[:,0], data[:,1], s=7, c='k')
plt.show()
In [3]:
print("Shape of the dataset is :", data.shape)
data
Shape of the dataset is : (800, 2)
Out[3]:
array([[ 1.66184123,  1.50369477],
       [ 4.64615678,  1.62186181],
       [ 1.16441476,  4.37372168],
       ...,
       [-5.94762563,  0.05925507],
       [-5.5282781 , -0.16683908],
       [-5.02162618, -0.15647292]])

Analysis


Above piece of code shows the shape of the dataset for which different centroids will be calculated.

Data is 2D in shape and also printing the data as well.

1. Generate random cluster centres

You need to generate four random centres.

This part of portfolio should contain at least:

  • The number of clusters k is set to 4;
  • Generate random centres via centres = np.random.randn(k,c)*std + mean where std and mean are the standard deviation and mean of the data. c represents the number of features in the data. Set the random seed to 6.
  • Color the generated centers with green, blue, yellow, and cyan. Set the edgecolors to red.

In the looming cells, number of clusters are defined i.e., k=4 and its mean and standard deviation is calculated.

In [4]:
k=4
mean=data.mean()
std=np.std(data)
print ("The mean for the normalised data is:",mean)
print ("The standard deviation for the normalised data is:",std)
The mean for the normalised data is: 2.3897478180960845
The standard deviation for the normalised data is: 4.319440166126304
In [5]:
np.random.seed(6)
centres = np.random.randn(k,2)*std + mean #centres variable stores value of the calculated centroids
print(centres)
[[ 1.0430169   5.53863665]
 [ 3.33061168 -1.4938254 ]
 [-8.35175241  6.33448312]
 [ 7.25803215 -4.15028729]]

Analysis


Random series has been generated and clusters are chosen 4 (k=4). Centres are generated as per the given formula in the description above (**np.random.randn(k,c)*std + mean, where std is the standard deviation - measure that is used to depict the amount of variation of data values from the avrerage value and mean is the average of the values)

In [6]:
def gen_random_centres():
    plt.scatter(data[:,0], data[:,1], s=6, c='k')
    color_coding=['g','b','y','c']
    for i in range(4):
        color=color_coding[i]
        plt.scatter(centres[i][0],centres[i][1],marker='*', s=400, c=color,edgecolor='r')
gen_random_centres()

Analysis


Above mentioned function gen_random_centres perform the plotting of different centroids on the data arbitrary with the defined color coding of green, blue,yellow and cyan color and edge color as red.

For loop is running 4 times because we have to update 4 centroids.

2. Visualise the clustering results in each iteration

Kmeans Algorithm

Kmeans clustering is an unsupervised machine learning algorithm used when we have data which has no labels, groups and categories; our aim is to search for the groups within the data with the value defined for k.

Basic steps involve in the Kmeans algorthim :

  1. Begins with K randomly placed centroids.
  2. We will assign each data point to the nearest centroid.
  3. Re calculate the centroids for each set of points(clusters) , assigning each data point to the closest one.
  4. If the centroid estimate has not changed significantly, terminate the process, otherwise repeat from step 2.
In [7]:
clusters=np.zeros(len(data))
clusters.shape
Out[7]:
(800,)
In [8]:
def euc_dist(point, centroid, ax=1):
    return np.linalg.norm(point - centroid, axis=ax) 

Analysis


To Calculate eucledian_distance , function has euc_dist has been defined .

In [9]:
def cluster_search():
    for i in range(len(data)):    
        distances = euc_dist(data[i],centres)
        cluster = np.argmin(distances)
        clusters[i] = cluster

Analysis


A function cluster_search has been defined to find the nearest cluster to assign a respective centroid to that cluster.

In [10]:
def centres_search(): 
    for d in range(k):
        pts=[data[j] for j in range(len(data)) if clusters[j]==d]
        centres[d]=np.mean(pts,axis=0)

Analysis


A function named centres_search has been defined to find the new centres using the averages of the points which belong to the same cluster.

In [11]:
cluster_search()
print(clusters)
[1. 1. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.
 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 1. 1. 0. 1. 0. 0. 0. 1. 1. 1. 0.
 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 1. 0. 1. 0. 1. 1. 0. 0. 0. 0. 1.
 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 1. 1. 0. 0. 1. 0. 0.
 0. 0. 1. 0. 1. 1. 0. 0. 0. 1. 0. 1. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1.
 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0.
 0. 1. 1. 1. 1. 1. 0. 0. 0. 1. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 1. 1. 0. 0.
 0. 1. 1. 1. 0. 0. 0. 0. 1. 0. 1. 1. 1. 0. 0. 1. 0. 1. 1. 1. 0. 1. 0. 0.
 1. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 3. 3. 3. 3. 3. 3. 3. 3.
 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 1. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.
 3. 3. 3. 3. 3. 3. 3. 1. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.
 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.
 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.
 3. 3. 3. 3. 3. 3. 3. 1. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.
 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 1. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.
 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 1. 3. 3. 3. 3. 3.
 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.
 2. 1. 2. 1. 1. 2. 1. 2. 2. 2. 2. 2. 1. 1. 2. 1. 2. 2. 1. 1. 2. 1. 1. 2.
 2. 1. 2. 1. 2. 1. 2. 2. 1. 2. 1. 2. 1. 1. 2. 1. 1. 2. 1. 2. 1. 2. 2. 2.
 2. 2. 2. 1. 2. 2. 1. 2. 1. 2. 1. 2. 1. 2. 1. 1. 2. 2. 2. 2. 1. 2. 2. 2.
 2. 2. 1. 2. 1. 2. 2. 2. 1. 2. 1. 1. 2. 2. 1. 2. 2. 2. 2. 2. 2. 1. 2. 2.
 2. 2. 2. 1. 2. 2. 2. 2. 1. 2. 2. 2. 2. 1. 2. 2. 1. 1. 1. 1. 2. 2. 2. 1.
 2. 2. 2. 2. 1. 2. 2. 2. 1. 2. 2. 1. 2. 1. 2. 1. 1. 1. 2. 1. 2. 2. 2. 1.
 1. 1. 2. 2. 2. 1. 2. 2. 2. 2. 2. 2. 1. 2. 1. 2. 1. 1. 1. 2. 2. 2. 1. 2.
 1. 1. 2. 2. 1. 2. 1. 2. 2. 1. 2. 1. 1. 1. 2. 2. 1. 1. 1. 2. 2. 1. 2. 1.
 2. 2. 2. 1. 1. 2. 2. 2.]

Analysis


Calling the aformentioned function cluster_search and print its outout.

In [12]:
centres_search()
print(centres)
gen_random_centres()
[[ 3.99620775  4.82013001]
 [-1.08109426 -0.14715591]
 [-5.54024032 -0.68960389]
 [ 9.09422694  1.02197195]]

Analysis

Printing the new centres in the form of an array.

Calling cluster_search and centres_search again and again to get new centres and plot them using gen_random_centeres.

In [13]:
cluster_search()
centres_search()
print(centres)
gen_random_centres()
[[ 4.90132371  5.15030358]
 [ 0.24633035  1.5433697 ]
 [-5.0796739  -0.95753889]
 [ 9.05236058  1.0794625 ]]

Analysis


Above piece of code shows 2nd iteration of updating centroids. gen_random_centres() prints the new centres.

In [14]:
cluster_search()
centres_search()
print(centres)
gen_random_centres()
[[ 5.76438788  5.71757292]
 [ 0.77137496  2.03200622]
 [-5.05717014 -0.94967031]
 [ 9.063639    1.04633907]]

Analysis


Above piece of code shows 3rd iteration of updating centroids. gen_random_centres() prints the new centres.

In [15]:
cluster_search()
centres_search()
print(centres)
gen_random_centres()
[[ 5.97243759  5.88784838]
 [ 0.93973117  2.13879788]
 [-5.05717014 -0.94967031]
 [ 9.063639    1.04633907]]

Analysis


Above piece of code shows 4th iteration of updating centroids. gen_random_centres() prints the new centres.

In [16]:
cluster_search()
centres_search()
print(centres)
gen_random_centres()
[[ 6.0053298   5.92528865]
 [ 0.98297825  2.15787959]
 [-5.05717014 -0.94967031]
 [ 9.063639    1.04633907]]

Analysis


Above piece of code shows 5th iteration of updating centroids. gen_random_centres() prints the new centres.

Conclusion

We can see that in the last iteration there is no significant change in the centroids, here we can hault our process and we can say that aformentioned plot is the final one.